Windows Server 2008 : Ongoing Backup and Recovery Preparedness

11/19/2010 2:45:52 PM

Creating and documenting processes that detail how to properly back up and recover from a disaster is an essential step in a disaster recovery project. Equally important as creating these processes is periodically reviewing, validating, and updating the processes. Disaster recovery planning should not be considered a project for the current calendar year; instead, it should be considered an essential part of regular business operations and should have dedicated annual budget and assigned staff.

Each year, many businesses, business divisions, or departments update their computer and network infrastructure and change the way they provide services to their staff, vendors, and clients. In many of these cases, the responsible information technology staff, cross-departmental managers, executives, and employees are not involved or properly informed in advance of the execution or implementation of these changes. Computer and network infrastructure changes can have ripple effects throughout an entire organization during transition and during disaster and failure situations, so proper planning and approval of changes should always be performed and documented.

To reduce the risk of a change negatively impacting business operations, many organizations implement processes that require new projects and system changes to be submitted, evaluated, and either approved or rejected based on the information provided. Although this chapter does not focus or even really discuss project management, all organizations that utilize computer and network infrastructures should consider implementing a Project Management Office and a change-control committee to review and oversee organizational projects and infrastructure changes.

Project Management Office (PMO)

In recent years, many organizations have introduced Project Management Offices (PMOs) into their business operations. A PMO is used to provide somewhat of a project oversight committee to organizations that frequently operate several projects simultaneously. Organizations that utilize a proven project methodology can further extend this methodology to include workflow processes that include checkpoints with the PMO staff.

The role of the PMO can be different in almost every organization, but most include a few key functions. The role of the PMO usually involves reviewing proposed projects to determine how or if the project deliverables coincide with the organization’s current or future business plans or strategies. PMO membership can also be very different among organizations. PMO membership can include departmental managers, directors or team leads, executive staff, employee advocates, and, in some cases, board members. Having the PMO staff represent views and insight from the different levels and departments of an organization enables the PMO to add value to any proposed project.

Having diverse staff included in the PMO staff enables the organization to evaluate and understand current and proposed projects and how these projects will positively or negatively affect the organization as a whole. Some of the general functions or roles a PMO can provide include the following:

High-level project visibility— All proposed projects are presented to the PMO and if approved, the project is tracked by the PMO. This provides a single entity that is knowledgeable and informed about all ongoing and future projects in an organization and how they align to business and technical objectives.
Project sounding board— When a new project is proposed or presented to the PMO, the project will be scrutinized and many questions will be asked. Some of these questions might not have been considered during the initial project design and planning phases. The PMO improves project quality by constantly reviewing and monitoring projects from when the project is proposed and during regular scheduled project status and PMO meetings.
Committee-based project approval or denial— The PMO is informed of all the current and future projects, as well as business direction and strategy, and is the best-equipped group to decide on whether a project should be approved, denied, or postponed.
Enterprise project management— The PMO tracks the status of all ongoing projects and upcoming projects, which enables the PMO to provide additional insight and direction with regard to internal resource utilization, vendor management for outsourced projects, and, of course, project budget and scheduling.

Change Control

Whereas a PMO improves project management and can provide the necessary checkpoints to verify that backup and recovery requirements are addressed within the new projects, an organization with a change-control system can ensure that any proposed changes have been carefully evaluated and scheduled before approval or change execution. Change control involves a submittal, review, and approval process for each change that typically includes the following information:

Change description— Includes which systems will be changed, what the change is, and why it is proposed or required.
Impact of the change— Details if any systems or services will be unavailable during the execution of the change and who will be affected or impacted by the change.
Change duration— Details how long it will take to execute and complete the change and, if necessary, revert or roll back the change.
Change schedule— Includes the proposed date and time to execute the change.
Change procedure— Details how the change will be executed, including a detailed description; this usually also includes detailed steps or an accompanying document.
Change rollback plan— Details the steps necessary to recover or roll back the change in the event that the change causes undesirable results.
Change owners— Includes who will execute the change and is responsible for communicating the status and results of the change back to the change-control committee.

A change-control committee, similar to a PMO, is made up of managers, executives, and employee advocates who will review and determine if the change is approved, denied, or needs to be postponed. Proposed changes are submitted in advance. A day or two later, a change-control review meeting is held where each change is discussed by the change-control committee and the change owner, and the change will be approved, denied, postponed, or closed, or more information will be requested.

During failure or disaster situations, going through the normal change-control process might not be an option due to the impact of the failure. During these situations, emergency change-request processes should be followed. An emergency change request usually involves getting the particular departmental manager and the responsible information technology manager, director, or CIO to sign off on the change before it is executed. In short, all changes need to be considered and approved, even in failure scenarios when time is of the essence. When an administrator is troubleshooting and trying to resolve a failure or trying to recover from a disaster, especially in a stressful situation, making changes without getting approval can lead to costly mistakes. Following the proper change-control and emergency change-control processes to inform and involve others, getting approval from management, and following documented processes will provide accountability and might even save the administrator’s job.

Disaster Recovery Delegation of Responsibilities

At this point, the organization might have a documented and functional backup and recovery plan, a PMO, and a change-control committee, but the ownership and maintenance of disaster recovery operations is not yet defined or assigned. Disaster recovery roles, functions, or responsibilities might be wrapped up into an existing executive’s or manager’s duties or a dedicated staff member might be required. Commonly, disaster recovery responsibilities are owned by the chief information officer, operations manager, chief information security officer, or a combination of these positions. Of course, responsibilities for different aspects of the overall disaster recovery plan are delegated to managers, departmental leads, and staff volunteers as necessary. An example of delegating disaster recovery responsibilities is contained in the following list:

The chief information officer is responsible for disaster recovery planning and maintaining and executing disaster recovery-related tasks for the entire telecom, desktop and server computer infrastructure, network infrastructure, and all other electronic and fax-related communication.
The manager of facilities or operations is responsible for planning alternate office locations and offsite storage of original or duplicates of all important paper documents, such as leases, contracts, insurance policies, stock certificates, and so on, to support disaster recovery operations to alternate sites or offices.
The manager of human resources is responsible for creating and maintaining emergency contact numbers for the entire company, storing this information offsite, and communicating with employees to provide direction and information prior to disasters striking and during a disaster recovery operation.

The list of responsibilities can be very granular and extensive and disaster recovery planning should not be taken lightly or put on the back burner. Although there are many aspects of disaster recovery planning, the remainder of this chapter focuses only on the disaster recovery responsibilities and tasks that should be assigned to qualified Windows administrators who need to support a Windows Server 2008 R2 environment.

Achieving 99.999% Uptime Using Windows Server 2008 R2

When the topic of disaster recovery comes up, many people think of the phrase “five nines” or “99.999% uptime.” Although understanding this concept is reasonably simple, actually providing five nines for a server or a network can be quite a large and expensive task. Achieving 99.999% uptime means that the server, application, network, or whatever is supposed to have this amount of uptime can only be down for just over five minutes per year. Having such success is quite a claim to make, so administrators should make it with caution and document it, citing explicitly what this service depends on. For example, if a power failure occurs and the battery backups will last only two hours, a dependency for a server could be that if a power outage occurs, it can withstand up to two hours without power.

To provide 99.999% uptime for services available on Windows Server 2008 R2, administrators can build in redundancy and replication on a data, service, server, or site level. Many Windows Server 2008 R2 services outlined in other chapters of this book, including Failover Clusters, Network Load Balancing, and the Distributed File System, can provide redundancy for the specific services available.